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Abstract 

This paper reports Part II of a two part effort that is 
intended to delineate the relationship between reliabil- 
ity and fault tolerant control in a quantitative man- 
ner. Reliability properties peculiar t;o fault-tolerant, 
control systems are emphasized, such as the presence 
of analytic redundancy in high proportion, the depen- 
dence of failures on control performance, and high risks 
associated with decisions in redundancy management 
due to multiple sources of uncertainties and sometimes 
large processing requirements. As a consequence, cov- 
erage of failures through redundancy management can 
be severely limited. The paper proposes to formulate 
the fault tolerant control problem as an optimization 
problem that maximizes coverage of failures through 
redundancy management. Coverage modeling is at- 
tempted in a way that captures its dependence on the 
control performance and on the diagnostic resolution. 
Under the proposed redudnacy management policy, it 
is shown that an enhanced overall system reliability can 
be achieved with a control law of a superior robustness, 
with an estimator of a higher resolution, and with a 
control performance requirement of a lesser stringency. 


1 Introduction 

Highly reliable systems make use of redundancy to 
achieve fault tolerance, due to limited reliability of 
components or subsystems^. Utilization of analytic 
redundancy! 9 ! that provided by static and dynamic 
relations among system variables, such as secondary 
functions of effectors, virtual measurements, projec- 
tions, etc. can further reduce the probability of exhaus- 
tion of hardware in a cost-effective manner. Analytic 
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ing coverage of failures in fault tolerant control systems alter- 
natively under the probabilistic formalization (rather than un- 
der the possibilistic formalization 1 ^ ), which allows the synergy 
with, and the direct utilization of the classical reliability theory 
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redundancy management of complex control systems, 
however involves considerable more risks in compari- 
son with such schemes as majority voting, for decision 
making is often based on residual signals formed by 
the differences between noisy measurements and cal- 
culated values of output variables based on inaccu- 
rate models. Decision errors can be associated with 
uncertainties on whether there is a subsystem failure, 
which subsystem has failed, how severe is its effect, 
whether it is necessary to take a drastic corrective ac- 
tion, which action to take. In addition, the question 
may also arise on whether there is adequate control 
relevant, redundancy^ and authority to allow recov- 
ery from the effect of the failure. The dynamic and 
closed- loop nature, common to all control systems, is 
the source for additional difficulties, such as temporary 
mask of the effect of subsystem failures, the vagueness 
in the definition of a system level failure in the context 
of control performance, and the sometimes significant 
processing requirement in supporting the redundancy 
management. 

There are many applications in which fault toler- 
ance may be achieved by using one of the adaptive 
control l* 1 * ^1, or reliable control^!, or reconfigurable 
control!' 1 ! strategies. As the the control action becomes 
progressively more drastic, the lielihood of involving an 
explicit diagnostic process becomes higher, and deci- 
sion making becomes riskier. Fault tolerant control in 
general is a subject too broad to be discussed in this 
paper. Instead, the discussion here will be confined to 
its relation to reliability and our inquiry on the con- 
trol strategy will be kept at the conceptual level. The 
reader is referred to [17, 4, 5, 16] and references therein 
for a more complete view of the state of the art, issues, 
and methodologies in the field of fault tolerant control. 

Definitions suggested in [14] on fault and failure are 
adopted with a slight modification. A fault is an un- 
permitted deviation of at least one characteristic prop- 
erty or variable of the system. A failure is a perma- 
nent interruption of a system’s ability to perform a 
requirec function under specified operating conditions. 
Note t hat a failure can also be defined in the subsystem 
level. A fault may or may not lead to a failure. With- 



out loss of generality, a subsystem failure is assumed 
to always lead to the system failure unless a successful 
management of redundancy ensues. Since this paper 
is concerned with closed-loop control systems to which 
occurrence of a failure depends on whether there is a 
loss of control performance, a properly defined con- 
trol performance threshold will be introduced in the 
paper to quantify the acceptable system performance 
level. The threshold must encompass requirements in 
stability, and in transient as well as steady state con- 
trol performance. It is assumed that no event or events 
would trigger a sequence of infinite reconfiguration ac- 
tions in which case the stability problem of a different 
nature must be considered ^ 1. A system level failure 
is declared when faults or subsystem failures cause the 
control performance of the system to fall below the pre- 
scribed threshold. The performance threshold can be 
set at two (or more) different levels, each correspond- 
ing to a specific reliability requirement . In aviation, for 
example, one level can be set by the ability to carry out 
a normal mission (or mission abort in terms of failure 
probability), and another can be set by the ability to 
merely maintain the system stability needed for safe 
landing (loss of control in terms of failure probability). 
This paper will treat different reliability requirements 
in a unified manner. 

Reliability is naturally a subjective concern in the 
analysis and design of fau It- tolerant, control systems. 
Reliability is rarely regarded as an objective crite- 
rion that guides a control system design in an inte- 
grated manner. This predicment is due to the diffi- 
culty in establishing a functional linkage between the 
over all system reliability, and the performance defined 
in the conventional sense at the bottom level for con- 
trols and for diagnosis. The paper is organized as fol- 
lows. Section 2 models coverage in fault tolerant con- 
trol systems, and delineates two important roles cover- 
age plays one as a criterion for off-line integrated fault 
tolerant control system design and one as a criterion 
for on-line minimum-risk redundancy management. Fi- 
nally, a functional linkage between reliability and con- 
trol/diagnosis performance is established. Section 3 
summarizes the findings of the paper. 


2 Coverage in Fault Tolerant Control 

This sect ion focuses on coverage modeling and evalua- 
tion, through which reliability will be tied to the design 
and operation of fault tolerant control systems. The 
previous section emphasizes statistical analysis based 
on failure data, and attempts to infer from the sample 
of failure data to some representative behavior of the 
general population. It is possible in that case to assume 
a range of coverage values in assessing a system’s relia- 
bility and determine what is the set of minimum cover- 


age values required for a given overall system reliability 
requirement. The reliability issues are viewed from a 
different perspective in this section. The concern now 
is with how to achieve the set of required coverage val- 
ues through proper designs of control and diagnostic 
modules Some of the basic ideas presented in this sec- 
tion follow those presented in [19] where a possibilistic 
formalization is used. It is the first time rigorous ar- 
guments are given under a clearly defined redundancy 
manege nent policy using a probabilistic formalization 
to confirm our intuitions on how control and diagnosis 
performance affect overall system reliability. 



Fig.l Sc hematic of fault tolerant control system 

Since the design of both feedback control and diagnos- 
tic algorithm depends on the model of the plant to be 
controlled, the task of modeling of individual physical 
processes for which specific reliability goals are to be 
implemented must be first tackled. A model suitable 
for the fault tolerant control purpose should reflect the 
effects of failures and availability of redundancy. Sup- 
pose all such conditions enter the plant model in the 
form of parameters. The effort of fault parameteriza- 
tion in inear parameter varying (LPV) formal, for 
example, is on going. A fault effect parameter space 
can then be defined as the Euclidean space of all pa- 
rameters that change their values as the result of some 
fault occurrence. The prescribed range of variation of 
such parameters form a set in the parameter space. 
Let 0 denote a vector in the space of dimension iV, and 
denote the set over which 0 reside when fault oc- 
curs. Without loss of generality, can be regarded as 
a hyper- rectangle 

^ = {01, min < 0/ < max, l = T •••» AT}, 

with 0 — 0 denoting the no-fault parameter vector. 

Next, control performance will be defined over the fault 
effect domain. In the schematic diagram shown in 
Fig.l, G(Q) represents a model for the input to output 
mapping- of the plant, including models of actuators 
and the sensors. The argument 0 is made explicit to 
indicate that the model is dependent on the fault ef- 






feet parameter vector. Vector w contains all external 
signals, including disturbances, sensor noises and ref- 
erence signals. Controlled output £ is an error vector, 
capturing the design specifications for the system; y Is 
the vector of measured variables; u is the vector of con- 
trol inputs. 1 has been suggested and discussed as 
a measure of control perofrmance in 1 19], and the com- 
putability of this measure as a function of two control 
effectiveness factors, has been demonstrated^! . 

Let J min denote a prescribed control performance 
threshold to distinguish the normal from a failed oper- 
ation for the controlled system, i.e. , a failure is declared 


where subscript U t denotes a particular control set- 
ting. Whenever (1) becomes the case, a control recon- 
figuration is becomes necessary. The essence of fault 
tolerant control lies with the management of the con- 
trol relevant redundancy. Depending on the severity of 
anomaly, management of control relevant redundancy 
can be carried out via a control law robust ificat ion, or 
adaptation, or reconfiguration. As the control action 
becomes progressively more drastic, the likelihood of 
involving an explicit diagnostic process becomes higher, 
and decision making becomes riskier. Since successful 
redundancy management depends on the knowledge of 
fault effect parameter 0, the challenge facing us is to 
acquire, to represent, and to utilize the knowledge in 
the presence of uncertainties. Fault; tolerance can be 
achieved only if sufficient redundant control authority 
exists in the system. The issue regarding the adequacy 
of control relevant redundancy is elaborated in [22]. 
The discussion on constraints that must be imposed 
when the set {Ui}iL\ is constructed can be found in 

[19]- 

Let us now extend the dimension of the fault effect 
parameter space by one to form an /V + 1 parameter- 
performance ( 0-J ) space, as depicted in Fig. 2. The 
horizontal plane in this figure is an abstract represen- 
tation of the space of fault effect parameters. The dis- 
tance of two fault effect parameter vectors is measured 
by the Euclidean norm. The vertical axis represents 
performance, a measure on how well t he specified con- 
trol objectives is achieved. A larger value along this 
axis corresponds to better achieved objectives. 

A point (0, J) in the 0-J space reflects the consequence 
of using a particular control law. Its projection on the 
horizontal plane specifies the corresponding fault para- 
meter value, its vertical axis value indicates the level of 
performance achieved. For a given control law, differ- 
ent fault effect parameter vector would result in a dif- 
ferent performance. Therefore, corresponding to each 
control law U l: there is a surface, as shown in 

Fig. 2. Differently configured control laws produce dif- 
ferent performance surfaces. The flat surface defines 


the performance threshold J min , corresponding to a 
set of minimum objectives. Any point (0. J) on the 
the i th s irface below this threshold corresponds to an 
under pel formed control law U{. 

0 — J Space 


0 45^ 
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Fig. 2 Graphical representation of a 0-J database, and 
its interaction with a diagnostic outcome 

Definition 1. Let 

S A = {0 £ J(/A (0) ^ Jmin}, S& = (0 € S2|J(/fj(0) > Jmin 

( 2 ) 

With respect to a ay 0* € H, control law U A is said to 
outperform control law U B if and only if 0* E S A , S B , 
and 

S A D S B . (3) 

Similarly, with respect to any set 0 C fl, control law 
U A is said to outperform control law U B if and only if 
0 C S A .S B , and 

S A D S B . (4) 

In fact control laws that are robust and adaptive are 
designed to outperform coventional control laws in the 
sense defined above. This part of the study is con- 
ducted mostly during the off-line design phase and its 
thoroughness is judged by the extent of exploitation 
and utilization of existing redundancy, and complete- 
ness of coverage of the fault effect in the sense whenever 
possible there should be at least one control law U{ 
that is not underperformed (Jcr(0) < Jmin) f° r every 
0 £ fi. The outcome of such an investigation is a 0-J 
database, which is to be stored for on-line use. On-line 
data can supplement the database for use with on-line 
redesign. Such a database is apparently application 
specific. 

The field of diagnosis of dynamic systems has matured 
over the past twenty years^®» Diagnosis no 

doubt plays a crucial role in fault tolerant control. It 
provides information on the parameter values in the 
form of estimate 0. Since any 0 can be called an esti- 
mate, a description of the uncertainty associated with 
an estimate is needed for the estimate to be useful. 
First and second order statistics of an estimate can pro- 
vide a reasonably prompt and accurate description of 


the uncertainty. They can be obtained through empiri- 
cal methods. An uncertainty description in the form of 
a probability density function f(9) is shown in Fig. 2. 
The spread of the density function describes the res- 
olution or the performance of the diagnosis algorithm 
used. For a normal distribution with mean 8 * and co- 
variance P, a hyper-ellipsoid can be formed as 

S = {0\(0 - 0 m Y P ’ 1 0 - 0 * ) < *}, 

where k > 0 defines the level of a constant probability 
density which in turn determines the size of the N- 
dimensional hype r-ellipsoid . Since the volume of £ is 
proportional to \/d et(/vP), 



can be used as an indicator of the resolution for es- 
timate 8. Since P is usually a function of time, so is 

R*. 


Definition 2. With respect to the same failure sce- 
nario , diagnostic algorithm A is said to outperform di- 
agnostic algorithm B at time t if and only if 

Pk(0>^k(0. Vk >0. (6) 

where R K is given by (5), and P is the second order 
central moment of f(9). 


Theorem 1. The optimal redundancy management 
policy is to select control law Uk that satisfies 

cu k (tc) = ._max {ct/,.(t c )}, (9) 

where t, : is the critical clearance time at which a cor- 
rective action must be taken , and Cir t is given in (8). 

Proof. Since an uncovered failure is associated with 
an exit state which is arrived at transition rate (1 — 
c Vi (£))Aj for all i and some j, where A j is the transition 
rate out of state j to a non-exit state, The above tran- 
sition rate enters only into the equation of exit state 
j in a Markov model and none of the other equations, 
i.e., 

PjD(t t St) — pjo(t) +■ <5^(1 — cu t (O)AjPj(O 

=► Pjo(t) = (1 - c Uf (O)AjPj(O* 

Selecting c Uk that satisfies (9) leads to minimizing the 
amount of increase in the rate of change for the exit 
state probability. This probability contributes to the 
aggregated system death state probability as one of 
possibly many additive terms. Therefore this policy 
maximizes the overall system reliability. Critical clear- 
ance time t c is used because it represents the longest 
time affordable for information acquisition and process- 
ing without jeopardizing the opportunity for perfor- 
mance restoration. □ 


For a given algorithm, R K (t) is generally an increasing 
function of t , which reflects the speed /accuracy trade- 
off commonly displayed. Due to the finite rate in data 
acquisition and processing, it is conceivable that a sys- 
tem with faster dynamics is more susceptible to the 
consequences of uncertainties at the time of decision 
making. 

We now define the notion of coverage. Coverage has 
been used as a parameter to reflect the ability of a 
system to automatically recover from the occurrence of 

a fault during a normal system operational: 

Coverage = Probability (System recovers] Fault occurs). 

At a given time, there is one coverage value associated 
with each control law in use. 


Definition 3. Denote coverage associated with using 
control law U{ by cu i (t). Denote the covered domain 
over which control law U\ provides .in acceptable per- 
formance by C D, i.e., 

fti = {^ € 12)^ (8)> dram }- (7) 


cuAt)= / f(0A)d0. 


The above integral should be understood as a combi- 
nation of a multi- variable integral. It represents the 
probability that estimate 8 resides within set fi; over 


which U{ restores the system operation. 


Coverage is a dynamic quantity because f(9,t) changes 
with time. Typically, as more time is allowed to collect 
and process measured data, the resolution, as defined 
in (5), o’ the estimate increases and the value of cover- 
age increases toward 1 as well, which is consistent with 
the claseical speed/accuracy tradeoff. All the values of 
coverage used in part I of the paper are static cover- 
age values. It is conceivable that in one application, a 
system may be able to afford to wait until sometime 
t c when a prescribed static coverage value has been 
reached to make a decision for redundancy manage- 
ment, while in another application, critical time t c is 
prescribed and a decision made at t c must carry a high 
risk with the achieved static coverage. 

It appears that the definition and the calculation of 
coverage- must involve an explicit estimation algorithm. 
This is i i fact not the case. Even for reliable control^ 8 ) 
where no estimation algorithm is involved at all, as 
long as there is a probabilistic description of the fault- 
effect parameter estimate (a uniform distribution over 
Q. in the case of total ignorance), and a database on 
control performance has been established, coverage is 
well defined. 

From Definition 3 it can be seen that coverage is re- 
lated to the performance of each individual control law 
Jir . , to the system performance threshold J mm , and to 
the diagnostic resolution determined by f(8). There- 
fore, in addition to its role as the criterion for optimal 



redundancy management, coverage also plays a crucial 
role in providing guidelines for integrated designs of 
fault tolerant control systems. Since the above men- 
tioned relation is uniquely and explic itly defined, the 
design guidelines are unambiguous and design results 
are measurable. 

In the statement of the next three theorems, variable t 
is supressed for simplicity. This can also be regarded as 
confining our interest to only static coverage for some 
prescribed critical clearance time t c . 

Theorem 2. Given a control performance threshold 
J min , and an estimate 0 G Q with a fixed distribution 
f(0). Suppose U A and U s are two candidate control 
laws. Then c v a > c v b if outperforms U{? . 


Theorem 4, Given a control law U t , a control perfor- 
mance threshold Jmim < in( i * wo estimates with distrib- 
utions A ’(O', Pa) and N(9\P B ), respectively, where 

0 G = {0 G ^ 

and Pb = X'P.A, X > T f n addition, assume CL d Ui (9) is 
convex (or all a > 0. Then Cjm > Cjj s . 

Proof. Since Pb — xPaj it follows from Definition 3 
that 

R k a = x n R% >Pb, V« > 0. 

Therefore diagnostic algorithm A outperforms algo- 
rithm B. Let a /,*(0) and a fs(Q) be the a-cuts of the 
two distributions over R N , where a > 0. Let Oq be 
such that a °/, 4 (0) = a ° /b(0)> and a max = /a(@*) = 

1/ \Z2tt^ dct( Pa) > The fact 


Proof. Since outperforms , it follows by Defini- 
tion 1 that 

= (0 G a{S) ^ drain} =? £2^ = {0 G ^1\J(jb{0) > dmin} * 

Then by Definition 3 

c^-c^ = f f(0)d0- f f(0)d0 = [ f(0)dO > 0. 

Jcl b Jn A r.Ci B 

Therefore c v a > c v b . □ 

Based on Theorem 2, a robustified control law that has 
achieved an expansion of the covered domain leads to a 
higher coverage. Similarly, if an expansion of the cov- 
ered domain has been achieved by making the control 
law adaptive, a higher coverage can be attained. 

Theorem 3. Given a control law Li, and an estimate 
0 G Q with a fixed distribution f{0). Ju ( (0) is a s- 
suemd to be a single valued function over $2. Suppose 
J Ln and J L n are two control performance thresholds . 
Then c v a > c v b if j£ in < J® in . 

Proof. Since is single valued over fi, its a-cuts^! 

a J Ui (9) = {6 <5 fi| JuM > Q,a > 0} 


r 


a fA(0)}d, 




T/b (0)\da — 1 


yields 
/*« 0 

J n - 


(0) -“/b(0)R 


«+ f r/A 

J « n 


{6) - ct fs{0))da = 0. (10) 


P 4 < P B implies that the first integrand is non- 
positive for each 0 and the second is non-negative for 
each 0. This still holds after restricting the a-cuts of 
the two distributions to the convex ,7 m i n -cut of J(j ( (0). 
When the restriction to J m i n -cut of d^. (0) affects only 
the first term, the first term becomes less negative, and 
(10) becomes positive. In this case, 


u- = / v 

d Slj 


L 1 


C U A C,,H = / [JaW - f B {0)]de = I \ a lA(0)-° L fB<,8)]da + 
+ I l a fA(0) f B (0)}da, 9€fi<>0, 


T 


and tin ?re fore, C\ja > Cijb . For every 9 at which the 
restriction to the J m ; n -cut of *7^(0) affects the second 
term in (10), the first integrand vanishes because of the 
nested structure of the a-cuts. Since 0 * G 12; C drain- 
cut of drj. (9). there will always be some set around 9* 
over which the second term is nonzero. Therefore (10) 
remains non-negative, and Cjja > □ 



coverage for a system with a more stringent control 
performance requirement. 


Fig. 3 N«isted-a-cuts of f(0) and 
fault effect parameter space 


-cut of dv i in the 



Theorem 4 shows for a special case that a higher reso- 
lution leads to a higher system reliability with coverage 
defined in (8). It is possible to extend this result with 
relaxed assumptions on the distributions of the two es- 
timates. 

Due to the page limit, we will not be able to present 
any examples here. The reader is referred to [20] for an 
example of a small scale, proof-of-concept fault toler- 
ant flight control system design where the bounds for 
coverage of a 75% loss of effectiveness of a control ef- 
fector is ploted against local time (start at the onset 
of loss pf effectiveness), and the coverage at the criti- 
cal clearance time is used for a control switch decision 
based on the minimum risk policy of Theorem 1. 


3 Conclusions 

The main contributions of the paper are presented in 
Definition 3, Theorems 1 through 4 and two corollaries 
in Section 2. 

Theorem 1 establishes that maximizing the coverage of 
the form expressed in (8) optimizes the reliability of 
a given fault tolerant control system. Theorems 2, 3, 
and 4 establish that the robustification of a control law, 
relaxation of control performance requirement, and en- 
hancement of diagnostic resolution help improve sys- 
tem reliability. 

It is recognized that both field and test data cru- 
cial to reliability study but sensitive from a market- 
competition and liability viewpoints are difficult and 
also expensive to obtain, while published accident data 
alone are not sufficient. Given the situation, new reli- 
ability measure and assessment tools that can provide 
more accurate information under less stringent data re- 
quirements are yet to be defined and developed. 
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